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Abstract To find efficient screening methods for high dimensional hnear regression mod- 
els, this paper studies the relationship between model fitting and screening performance. 
Under a sparsity assumption, we show that a subset that includes the true submodel always 
yields smaller residual sum of squares (i.e., has better model fitting) than all that do not in a 
general asymptotic setting. This indicates that, for screening important variables, we could 
follow a "better fitting, better screening" rule, i.e., pick a "better" subset that has better 
model fitting. To seek such a better subset, we consider the optimization problem associated 
with best subset regression. An EM algorithm, called orthogonalizing subset screening, and 
its accelerating version are proposed for searching for the best subset. Although the two 
algorithms cannot guarantee that a subset they yield is the best, their monotonicity prop- 
erty makes the subset have better model fitting than initial subsets generated by popular 
screening methods, and thus the subset can have better screening performance asymptoti- 
cally. Simulation results show that our methods are very competitive in high dimensional 
variable screening even for finite sample sizes. 

KEY WORDS: Best subset regression; Combinatorial optimization; Dimensionality reduc- 
tion; EM algorithm; Orthogonal design; Sure screening property; Variable selection. 
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1 Introduction 



Regression problems with large numbers of candidate predictive variables occur in a wide 
variety of scientific fields, and then become increasingly important in statistical research. 
Suppose that there are p predictive variables Xi, . . . , Xp. Consider a linear regression model 

y = X/3 + £, (1) 

where X = (xjj) is the nx p regression matrix, y = . . . , ?/„)' G M" is the response vector, 
f3 = . . . , /3p)' is the vector of regression coefficients corresponding to Xi,...,Xp and 
£ = {ei, . . . jEn)' is a vector of independent and identically distributed random errors with 
zero mean and finite variance a^. Without loss of generality, assume that X is standardized 
with Er=i3;ij = and Er=i ^^iji = ELi ^^^ja jJi^h e and that y is 

centred with J2i=iyi = 0- Throughout this paper, we denote the full model {1,. . . ,p} by 
Zp. For A C Zp, X_4 denotes the submatrix of X corresponding to A. For 6 G MP, 0^ 
denotes the subvector of corresponding to A. For a vector x, ||x|| denotes its Euclidean 
norm. For a set S, \S\ denotes its cardinality. 

With a large number of variables in ([T]), model interpretability becomes important in 
statistical applications. We often would like to eliminate the least important variables for 
determining a smaller subset that exhibit the strongest effects. An increasing number of 
papers have studied on ([T]) with the sparsity assumption that only a small number of variables 
among Xi, . . . ,Xp contribute to the response. If the underlying model is actually sparse, 
the prediction accuracy can be improved by effectively identifying the subset of important 
variables. When p is much larger than n, Fan and Lv (2008) proposed a two-stage procedure 
for estimating the sparse parameter f3. In the first stage, a screening approach is applied to 
pick M variables, where M < is a specified number. In the second stage, the coefficients 
in the screened M-dimensional submodel can be estimated by well-developed regression 
techniques for situations where the variables are fewer than the observations. To guarantee 
the effectiveness of this procedure, the screening approach used in the first stage should 
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possess the sure screening property, i.e., it retains all important variables in the model 
asymptotically (Fan and Lv 2008). Several screening approaches have been studied in the 
literature; see Fan and Lv (2008), Wang (2009), and Biihlmann and van de Geer (2011) 
among others. 

This paper aims to provide some new viewpoints on variable screening when p is much 
larger than n. We first investigate the relationship between model fitting and screening 
performance. Here model fitting of a submodel is described by the magnitude of the (residual) 
sum of squares it yields. Small sum of squares corresponds to good model fitting. Consider 
the following question: if a submodel has better model fitting, can we say that the submodel 
is more likely to include all important variables? Interestingly, the answer is "yes" in a 
general asymptotic setting. The answer provides us a rule to screen variables, i.e., we should 
pick a submodel with good model fitting. We call this rule "better fitting, better screening" . 
To make it clear, let denote the true submodel {j e Zp : /3j ^ 0} with d— \Aq\. With 
a specified M ^ d, let 2lo and 2li denote the sets {A C Zp : |^| = M, C A} and 
{A C Zp : \A\ = M, Ao\A^ 0}, respectively. The "better fitting, better screening" rule 
tells us that the sum of squares from a submodel T G 2lo is asymptotically smaller than that 
from any ^ e 2li under regularity conditions. In other words, a submodel can include 
asymptotically if it is better than other submodels in the sense of model fitting. The 
ratio of these "better" subsets to all M-subsets is |2lo|/|Zp|. 

In practise, how do we find one of these better subsets? Let us consider the following 
optimization problem 

mm ||y - X/Sf subject to ||/3||o ^ M, (2) 

where || • ||o denotes the ig norm that refers to the number of nonzero components. This 
problem is equivalent to a combinatorial optimization problem 

min ||y — X_4;9_4||^ subject to |^| = M, 
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where is the least squares estimator under the submodel A. Therefore the solution 
to (E]) yields the best subset of size M. From the above discuss, the best subset clearly 
belongs to the set of all the "better" subsets, and thus possesses the sure screening property. 
Consequently, we can search for better subsets using efficient algorithms for ([2]). Even though 
such algorithms seldom reach the (global) solution, local solutions with small sums of squares, 
which have good screening performance as well, can be obtained. 

For small p, an exhaustive search over all possible subsets can be used to solve (E]). 
A branch-and-bound strategy has been developed to reduce the number of subsets being 
searched; see Beale, Kendall and Mann (1967), Hocking and Leslie (1967), LaMotte and 
Hocking (1970), Furnival and Wilson (1974) and Narendra and Fukunaga (1977). Some 
later improvements can be found in Gatu and Kontoghiorghes (2006) and references therein. 
When p is moderate or large, such subset searches are infeasible. Some simplified procedures 
like forward stepwise selection (abbreviated as FS) can be used to give sub-optimal solutions; 
see e.g. Miller (2002). Note that the solution to (|2]) has a closed form when the regression 
matrix X is (column) orthogonal. In Section [3] we provide an EM algorithm to solve ([2]). The 
basic idea behind this algorithm is active orthogonalization (Xiong, Dai and Qian, 2011), 
which embeds the original problem into a missing data problem with a larger orthogonal 
regression matrix. We call this algorithm orthogonalizing subset screening (OSS). As an EM 
algorithm, OSS possesses the monotonicity property, i.e., the sum of squares is not increased 
after an iteration. Therefore, for any sparse estimator, OSS can be used to improve its fitting 
by putting it as an initial point. By the "better fitting, better screening" rule, the screening 
performance can be improved as well. An accelerating algorithm, called fast orthogonalizing 
subset screening, is also provided. Simulations and a real example are presented to evaluate 
our methods. All proofs in this paper are presented in the Appendix. 



2 The "better fitting, better screening" rule 

When the underlying model ([1]) is actually sparse, it is desirable to screen M variables 
that include all important variables. In this section we discuss the "better fitting, better 
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screening" rule for this purpose. 

We denote any generalized inverse (Ben-Israel and Greville 2003) of a matrix A by A~. 
Note that for a submodel A, the least squares estimator (X^X_4)^X^y is not unique if 
is not of full (column) rank. We write = (X^X^)^X^y for meaning that 6 belongs to 
the set (X^X^)"X^y. For A C Zp, let (3 denote the vector with (3^ = (X^X^)"X^y 

- A 

and /S^p^^ = 0. In this section we let f3 denote the true parameter in model ([T]). We denote 
by Amax(-) and Amin(-) the largest and smallest eigenvalues of a matrix respectively. The 
notation d, %), and 2ti are defined the same as in Section [H 

Assumption 1. The random error e in (C]) follows a normal distribution A^(0, a^I), where 
I denotes the identity matrix. 

Assumption 2. There exists a constant C > such that Yl^=i ^Ij/^ ^ ^ /^'^ ^^?/ J ^ -^o- 

In practise, the regression matrix X is usually standardized with X^ILi ^Ij/^ ~ ^ 
j G Zp, and then Assumption [2] holds. 
Define 

— Amin(X^o\ytH_4X^o\^) , 

where = I — X^(X^X^)~X^ is the projection matrix on the subspace {x G M" : X^x = 
0}. It can be seen that 5„ measures some discrepancy between the two subspaces spanned 
respectively by the column vectors of X_4(, and X.Xp\Ao- '^^^ following assumption requires 
that 6n (with the signal ||/3||) cannot converge to zero too fast, which rules out the case of 
strong coUinearity between important variables and unimportant variables. Note that in this 
paper we focus on deterministic regression matrices. This makes our results applicable to 
designed covariates such as supersaturated designs (Wu 1993; Lin 1993). 

Assumption 3. As n ^ oo, {Sn\\(3\\)-^ = O(n^i), (5„||/3f )-i = 0{n"'^), M = Oin"'^), and 
log(p) = 0{n^^), where 7i ^ 0, i = 1, . . . , 4, 271 + 373 + 74 < 1, and 272 + 273 + 74 < 1. 

Theorem [T] shows the "better fitting, better screening" rule for variable screening, which 
means that, with probability tending to 1, a submodel that includes ^0 yields smaller sum 
of squares than any submodel that does not. 
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Theorem 1. Under AssumptionUl 0, andl^ if M ^ d, then as n ^ oo, 

P (^|max||y-X^'^||^} < min | ||y - X^"^|f }^ = 1 - O (exp(-Cin'^)) , 

where v = min{l — (271 + 873 + 74), 1 — (272 + 273 + 74)} and Ci > is a constant. 

For an M-subset T of Zp, define m{r) = {A e Zp : \A\ = M, ||y - Xf3^\\'^ < ||y - 
X/3 II }. We call T a better subset if |OT(T)| ^ |2li|, i.e., T is better (in the sense of 
model fitting) than at least |2li| subsets. The ratio of all the better subsets to all M-subsets 
is |2lo|/|Zp| = {lf^j}/{M) = *hich is increasing on M G {2d,n) for fixed p and 

d < n/2. Theorem [1] implies the following sure screening property of better subset regression. 

Theorem 2. Under the same conditions as in TheoremUl for any better subset T , we have 

P(r D A) = 1 - O {eM-Cm")) 

as n 00, where v and C\ are the same as in TheoremUl 

It is clear that the solution to ([2]) yields the best subset that belongs to the set of better 
subsets, and thus has the sure screening property by Theorem [2j 

After screening M variables by better subset regression, we can estimate the coefficients 
of the corresponding submodel by well-developed regression techniques for situations where 
the variables are fewer than the observations. It is desirable to use a regularization method 
that can improve on least squares regression in terms of variable selection and estimation 
accuracy. Such methods include the nonnegative garrote (Breiman, 1995), the lasso (Tib- 
shirani, 1996), SCAD (Fan and Li, 2001), the adaptive lasso (Zou 2006), and MCP (Zhang 
2010). Xiong (2010) presented some advantages of the nonnegative garrote in interpreta- 
tion and implement. The ridge regression-based nonnegative garrote method can have good 
performance even when the variables are highly correlated. 
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Figure 1: The solution to (|2]) with M = 1 in the case of two variables, where the circles and 
ellipses are contours of the objective function in ([2]), and j3-^^, (3^ and denote the least 
squares estimators under the full model and two submodels, respectively. On the left-hand 
side, the regression matrix is orthogonal. The solution to is /3]^, which corresponds to the 
larger component of /3Lg (see Theorem E]). This is not the case when the regression matrix 
is nonorthogonal. The right-hand side shows an example that the larger component of /^Lg 
does not correspond to the solution to ([2]). 

3 Orthogonalizing subset screening 

3.1 Orthogonalizing subset screening: an EM algorithm 

From the previous section, the better subsets with good model fitting have good asymptotical 
screening property. To obtain a better subset, in this section we consider the optimization 
problem (|2]) that yields the best subset. A new iterative algorithm, called orthogonalizing 
subset screening (OSS), is proposed for solving ([2]). Since (|2]) is an N-P hard problem, 
our algorithm cannot guarantee achieving the best subset. Fortunately, with an appealing 
monotonicity property, OSS improves the model fitting of an initial sparse estimator, and 
thus often improves its screening performance by the "better fitting, better screening" rule. 
Define 

/(/3) = ||y-X/3f, 

which is the objective function in ([2]). For a vector x = (xi, . . . , Xp)' G let U denote 
the set of the subscripts corresponding to the M largest values of |a;j|'s. Define a map 
z = (zi, . . . , -Zp)' = S'j\/(x) to be Zj = Xj for j & U and Zj = otherwise. If S'j\/(x) has 
multiple values, we take it to be an arbitrary fixed value among them. 
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Theorem 3. //X'X = cl, then (3* = S'j\/(X'y)/c is a solution to 

Figure [H shows the difference between orthogonal and nonorthogonal cases for solving ([2]) 
when p = 2 and M = 1. 

Theorem [3] inspires us to embed ([2]) into a problem with (column) orthogonal regression 
matrix. This idea, called active orthogonalization, was proposed by Xiong, Dai and Qian 
(2011) for computing penalized least squares estimators. Here we apply it to (|2]). Take a 
number c ^ Amax(X'X). Note that cl — X'X ^ 0. Let A be a matrix satisfying A'A = 
cl — X'X. Therefore 

is orthogonal. Consider the following linear model 

Ye = X,/3 + (3) 

where yc = (y', y^)' is the complete response vector including a missing part jm- Based 
on the complete model in ([3]), we can solve ([2]) by iteratively imputing y^- Let /3''°^ be an 
initial point. For A; = 0, 1, . . ., impute y^ as y^^p = A(3^''\ let yc,imp = (j', yLp)' and solve 

mm ||y,,i„p - X,/3||2 subject to ||/3||o ^ M. (4) 

Since Xc is orthogonal, the above problem has a closed-form solution by Theorem [31 This 
leads to the following iteration formula 

= Sm (c-^X'y + (I - c-^X'X)/3('^)) . (5) 

We call this algorithm orthogonalizing subset screening (OSS). In this paper, we always set 
c in ([S]) to be Amax(X'X), which can be computed by the power method (Wilkinson, 1965). 

It can be seen that the OSS algorithm is an EM algorithm (Dempster, Laird and Ru- 
bin, 1977). Assume that the complete data yc = (y',ym)' follows a normal distribution 
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N^X^cf^, I). The likelihood function is 



L{/3 I y) = (27r)-"/2 



exp 



( 



Given (3^ ' , the E-step of the EM algorithm is 



E log{L(/3|y,)} |y,/3' 



nlog(27r)/2 



y - X/3f/2 - \\Af3^''^ - A/3||V2. 



The M-step of the EM algorithm is to minimize the above expectation subject to the constrain 
||/3||o ^ M, which is equivalent to (jH). 

Unlike FS that always tracks one path, different choices of initial points make OSS very 
flexible. We can even take a point that does not satisfy the constraint ||/3||o ^ M to be 
the initial point of the OSS algorithm. Since often has many local minima, a multiple- 
initial-point scheme can be used in the OSS algorithm to obtain a relatively good solution. 
For each initial point, we conduct OSS until it converges. The solution corresponding to the 
smallest value of the objective function will be taken to be the final estimator. 

There are connections between OSS and other variable selection methods. When the 
initial point /3''°^ in ([5]) is the zero vector, the submodel selected by f3^^^ is the same as that 
selected by Fan and Lv (2008) 's sure independence screening (SIS). Unlike SIS, OSS will go 
on seeking better submodels after this iteration. When the p least squares estimators under 
one- dimensional submodels are taken to be initial points, OSS looks similar to the L2Boosting 
algorithm (Biihlmann and Hothorn, 2007; Zhao and Yu, 2007). Both of them can produce 
better fits by combining these simple regression estimators. A difference between them is 
that, L2Boosting successively enters the variables, whereas OSS keeps the same number of 
variables after each iteration as we want. Besides, OSS can give a further improvement to 
the estimator from L2Boosting by using it as an initial point. 
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3.2 Monotonicity of OSS 

Write 

/3(^+^) = Tm(/3('^ (6) 

where the map Tm is defined by ([5]). Like other EM algorithms, OSS has the monotonicity 
property, which is stated below. 

Theorem 4. For any f3 eW with ||/3||o < M, f{TM{/3)) ^ f{f3). 

By Theorem m the iterative map T in OSS can improve on any sparse estimator in terms 
of model fitting, and can keep its sparsity simultaneously. Specifically, let /3 be a sparse 
estimator with ||3||o = m. The OSS sequence {Tm\^)} reduces the residual sum of squares 
step by step. When the iterative process stops by some stopping rule in the i^th iteration, 
we take = Tm\'^) to be a new estimator of (3. It is obvious that ||/9||o = ||3||o, i.e., (3 has 
the same sparsity as f3. After the improvement process, the final estimator (3 fits the model 
better, and often has better screening performance than the initial estimator. 

OSS can also improve the model fitting of a class of sparse estimators using the multiple- 
initial-point scheme. Let 0^, . . . , 0^ he q sparse estimators of /3. Denote m = max 1 ||o, 
. . . , After conducting OSS iterations {Tm\Pj)} for all j = we use ^ 

correspondding to the smallest value of the residual sum of squares as the final estimator. 

3.3 Convergence properties of OSS 

This subsection focuses on convergence properties of OSS. By TheoremHl we can immediately 
obtain the monotonic convergence property of {/(/3^^^)}. 

Theorem 5. Let be a sequence generated by For any e W, 

converges monotonically to a limit as k ^ oo. 

Recall that the map Tm in (EI) is not continuous. Although almost all OSS sequences 
converge in our numerical studies, counterexamples exist in some special cases. By Theorem 
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[5l we can stop an OSS iteration in ([6]) when the sum of squares does not decrease numerically 
any more. 

The general tools for proving the convergence of an EM algorithm (Zangwill, 1969; Wu, 
1983) are not applicable to OSS because of the discontinuity of Tm- However, it is possible to 
obtain good convergence properties of OSS under certain conditions. The following theorem 
shows that an OSS sequence can converge to the global solution when the initial point lies in 
a neighborhood of the global solution. Recall that (E]) is a combinatorial optimization prob- 
lem. This theorem makes OSS similar to an effective algorithm for a continuous nonconvex 
optimization problem. 

Theorem 6. Let {(3^''''} be a sequence generated by Suppose that the problem ^ has a 
unique solution denoted by (3* with ||/3*||o = M. Then there exists a neighborhood D G MP 
of (3* such that, for any fi^^^ G D, (3^^^ /3* ask ^ oo. 

In practice, it is difficult to locate the neighborhood of (3* required in Theorem El How- 
ever, this theorem still provides us a direction to search (3* . When n is sufficiently large 
and the true (3 is sparse, we know that (3* is close to the true (3. Therefore, a consistent 
estimator of /3, which is obtained by a computationally inexpensive method, can be used 
as the initial point in OSS to approach (3* . Using this way, we are more likely to obtain 
better subsets with good screening performance. For example, the lasso, SCAD, and ridge 
regression estimator are consistent under some regularity conditions even when p is much 
larger than n (Biihlmann and van de Geer 2011; Fan and Lv 2011; Shao and Deng 2012). 

3.4 Fast orthogonalizing subset screening 

Like other EM algorithm, a disadvantage of OSS is its sometimes very slow convergence. 
Here we provide a method to speed up the OSS algorithm. Note that the least squares 
estimator yields the least residual sum of squares under a submodel. The iteration formula 
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([5]) can be replaced by 



0^'^ = Sm {c-'X'y + (I - c-iX'X)/3(^)) , 

A^'^ = {j e : 4>f^ ^ 0}, 

/3(^+^) = (x;,(,,X^(.,)+X;,(,,y, (7) 

where denotes the Moore-Penrose generahzed inverse. We call this algorithm fast or- 
thogonalizing subset screening (FOSS). FOSS can reduce significantly iteration times of OSS 
and has similar properties to OSS including the monotonicity property. 

4 Simulations 

4.1 Deterministic design cases 

Supersaturated designs are commonly used in screening experiments for studying large-scale 
systems. In this simulation study, the design matrix X in ([1]) is taken as supersaturated 
designs, and the coefficients are given by /3i = • • • = /Ss = 1 and f3j = for other j. The 
supersaturated designs are constructed from the Kronecker tensor product of a small two- 
level supersaturated design with n = 12 and p = 66 in Wu (1993) and two mxm Hadamard 
matrices (Agaian 1985). Here m = 2 and m = 4 are considered. Therefore we have two 
configurations of n and p: n = 24, p = 132 and n = 48, p = 264. 

The following four methods are considered as basic methods for comparisons: Efron et al. 
(2004) 's least angle regression (LAR), Fan and Lv (2008) 's SIS and iterative SIS (ISIS), and 
FS. Besides their popularity in variable screening, the reason why we choose them is that 
the number of variables selected by them can be exactly controlled to be a specified number. 
Hence, we can compare them with our FOSS algorithm at the same size of submodels. 
Corresponding to the basic methods, four FOSS type algorithms are used in our simulations, 
which are denoted by FOSS-LAR, FOSS-SIS, FOSS-ISIS, and FOSS-FS, respectively. After 
LAR, SIS, and ISIS select M-dimensional submodels, FOSS-LAR, FOSS-SIS, and FOSS- 
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Table 1: Simulat 


ion result 


;s in Section 


|4.1|(M 


= 10) 


Method 


n = 


24 


n = 


48 




CR 


AO 


CR 


AO 


LAR 

FOSS-LAR 


0.167 
0.277 


38.15 
16.67 


0.306 
0.795 


69.62 
36.15 


SIS 

FOSS-SIS 


0.013 
0.080 


19.78 
13.28 


0.827 
0.947 


37.52 
30.96 


ISIS 

FOSS-ISIS 


0.028 
0.038 


10.54 
9.955 


0.974 
0.974 


26.17 
25.44 


FS 

FOSS-FS 


0.192 
0.192 


3.115 
3.096 


0.982 
0.982 


16.61 
16.57 



ISIS respectively use the least squares estimators under the corresponding submodels as 
initial points in the FOSS iteration ([7]), and derive new M— dimensional submodels. To 
obtain better local solutions to (EJ, we use the multiple-initial-point scheme in FOSS-FS. 
The initial points are set as the least squares estimators under L-dimensional submodels 
selected by FS with L from M — [p/10] to min{M + [p/10], n}, where [■] denotes the floor 
function. In this subsection, M is fixed as 10. 

We use 1000 repetition in the simulations. There are two criteria to evaluate the eight 
methods: coverage rate (CR) and average objective values (AO), which denote the percentage 
of a method that includes the true submodel and average value of the sums of squares over 
the 1000 repetition, respectively. The simulation results are presented in Table [1] It can be 
seen that, FOSS cannot only reduce the sum of squares, but also yield local solutions to ([2]) 
with better, or at least the same, screening performance. In particular, the local solutions 
around LAR and SIS derived by FOSS significantly improve their CRs, respectively. It is 
also worthwhile noting that the results for n = 24 seem inconsistent with the "better fitting, 
better screening" rule (FOSS-LAR has the best screening performance but larger AO than 
FS and FOSS-FS). This may be due to the small n. When n = 48, we can see that the 
results follow this rule better. 
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4.2 Random design cases 

In the simulation we use the following model 

F = /3o + /3iXi + --- + /3pXp + £, (8) 

where Xi, . . . , Xp are p predictors and e ~ A^(0, 1) is noise that is independent of the 
predictors. The predictors {Xi, . . . , Xp)' is generated from a multivariate normal distri- 
bution A^(0, S) whose covariance matrix S = {o'ij)pxp has entries an = 1, i = l,...,p 
and cTij = p, i ^ j, where p = 0, 0.5, and 0.9 are considered. The coefficients are given 
by /?! = • • • = = 3 and f3j = for other j. We use two configurations of n and p, 
n = 50, p = 50 and n = 200, p = 500, which represent small sample cases and large sample 
cases, respectively. For each model, we simulate 1000 data sets. 

We compute the CRs and AOs of the same eight methods with M = 30 as in Section 
14.11 and the results are shown in Table El It is clear to see that, (i) FOSS can improve the 
four basic methods in terms of CR in most cases, especially when p = 0. (ii) When p is 
large, FOSS-LAR, FOSS-SIS, and FOSS-ISIS cannot give significant improvement in model 
fitting since there are many local solutions to (I2]). In spite of this, each of the three FOSS 
algorithms has at least the same CRs as the corresponding basic method. Unlike them, 
FOSS-FS reduces the sum of squares of FS much for all p's because of the multiple-initial- 
point scheme, (iii) The "better fitting, better screening" rule holds, especially in the large 
sample cases. FS, with the smallest sum of squares, usually has the largest CRs among the 
four basic methods, and FOSS-FS performs better than FS. 

5 A real data example 

We apply our methods to analyze some CT image data. The dataset used here for illustrating 
our methods is a part of the whole dataset in Frank and Asuncion (2010), and is also 
available from the author. The dataset was retrieved from a set of 225 CT images from a 
person. Each CT slice is described by two histograms in polar space. The first histogram 
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Table 3: 


Results in 


Section [5] 


Method 


Test 


Sum 




error 


of squares 


FS 


0.626 


3.881 


FOSS-FS 


0.472 


2.421 



has 240 components, describing the location of bone structures in the image. The second 
histogram has 144 components, describing the location of air inclusions inside of the body. 
Both histograms are concatenated to form the final feature vector. The response variable is 
relative location of an image on the axial axis, which was constructed by manually annotating 
up to 10 different distinct landmarks in each CT volume with known location. More detailed 
description of the dataset can be found in Graf et al. (2011). Among those 225 images, 200 
of them are set as the training sample and the remaining 25 of them are set to be the test 
sample. 

We use linear regression to analyze the relationship between the feature vector and the 
response. Here the sample size n = 200, which is much less than the number of variables, 
p = 384, in the feature vector. There are high correlations between the variables. We 
want to select a small part of variables to simplify the model and to improve the prediction 
accuracy. FS and FOSS-FS with M = 20 are applied here since they have showed us good 
performance in the simulation studies. After obtaining a submodel with 20 variables by FS 
or FOSS-FS, we compute the least squares estimator under the submodel. The test errors 
and numbers of selected variables corresponding to FS and FOSS-FS are shown in Table |5l 
Since the variables in the feature vector are highly correlated, the two subsets selected by 
the procedures are quite different. FOSS-FS performs better than FS in terms of test error. 

6 Discussion 

This paper extends best subset regression, a classical variable selection technique, to better 
subset regression. For a screening purpose, we do not need to find the best subset, and a 
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"better" one is enough. From the discussion in Section [2] and [3l the word "better" here 
has two- fold meaning. In theory, a subset is called "better" if it is better than at least |2li| 
other subsets in terms of model fitting. We show that this characteristic implies the sure 
screening property. In implementation, "better" lies in the monotonicity property of OSS 
and FOSS. The two algorithms can improve model fitting of any initial subset, and thus lead 
to better screening performance asymptotically. Simulation results in Section H] show that 
FOSS usually yields subsets having better screening performance than the initial estimators 
given by popular screening methods. 

A simple, maybe the simplest, algorithm for solving ([2]) is FS, which is commonly used 
in practice and has good screening property (Wang 2009). From the simulation results in 
Section |472| FS performs quite well since it has relatively small sum of squares. This indicates 
that FS deserves more attention for variable screening. FS can be used as a good "base" 
procedure. Based on it, we expect to find more satisfactory methods. For recent studies on 
FS and its modifications, we refer the reader to Zhang (2011). This paper also provides the 
FOSS-FS method that gives a further improvement on FS. 

The screening methods from better subset regression proposed in this paper can be 
modified to apply to other variable selection problems, e.g., selection of grouped variables 
(Yuan and Lin 2006; Huang et al. 2009; Zhao, Rocha and Yu 2009; Zhou and Zhu 2010), 
and more general models, e.g., the generalized linear model. We are interested in whether 
the "better fitting, better screening" rule holds for more general cases and believe that this 
is a valuable topic in the future. 
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Appendix 



Lemma 1. Let Xn be a chi-square random variable with degrees of freedom n. We have 

y(^^^<^ exp (^-I^iinil!^ for z>l (9) 



n j \ 42; 



P ( ^ ^ ^ exp {-"^—^ ) for z< 1. (10) 



and 

n ^ 7 ^ ^^^^ V 4(2 - z) 
The lemma can be proved by the Bernstein inequahty (Uspensky, 1937), and its proof is 
omitted here. 

Proof of Theorem [II We have 



p I iZ — "^^ II ^ 1 + 2r/ 1 + P f Jl^^ — II ^ 1 + 2r7 
(n — r-j-)a'^ j \ (n — r'j-)a'^ 



where rj- is the rank of Xj- and ?7 = 5„||/3|| /(4cr ). 
Note that lly - X/3^|| V^x^ ~ xl-rr- % ©, 



where C2 > is a constant. 

Next we consider y — X/3 , which can be written as 



^ £'H^£ + 2/3VXVH^s + n5„||/3f. 



Ill 



P ( "f,_^f)lf ^ 1 + 2^) ^ exp (-^^^1^) = O (exp [-C.(5„||/3f )^n]) , (12) 
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Note that e'H^e/a^ ~ Xn-rj^ ^''^^ /^^X^H^e ~ N{0,v'^), where is the rank of and 
y2 = or2^^X^^H^Xr/9^. By Assumption El 

^ a2A^a.(H^)An,ax(XVXr)||/3f ^ (rhi{X'rXr)\\(3f ^ nCMa'\\(3f. (13) 

We have 

P T ^ 1 + 

^ P ^+ \ ^l-2ri 

V ncr^ / V na'^ 



For sufficiently large n, by (fTOjl . 



P I £2!4£ < 1 _ „ W P f « 1 _ „/2) < exp I " « 



na'^ J yin — rj[)a'^ J \ 16 + 877^ 

0(exp[-C3(5„||/3f)2n]), ' (15) 



where C3 > is a constant. Denote the distribution function of the standard normal distri- 
bution by $. Since 1 — $(x) < exp(— x^/2)/x for any x > 0, by ( fT3l) . 



oi A:?.„exp(-^'"yii. (16) 



v^*..||(3|| V 

where C4 > is constant. 

Note that |2lo| ■ l^il < p^^^ Combining ([T2]), ([14]), ([T5]), and ([T6]), we have 

O (exp [-^.(^J/3||T.]) + O ^^^exp ^ ^ii^^ 



> l-p2M 
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where C5 = min{C2,C3}. By Assumption |3l we complete the proof. □ 



Proof of Theorem O We have 

P(r D A) > P ({ max lly - XpY} < {ll>- " ^^V) ' 

By Theorem [U we complete the proof. □ 

Proof of Theorem 0. Let ^ = 0i, . . . , = X'y/c be the least squares estimator under 
X'X = cl. Note that /(/3) = c\\f3 - + ||yf - ||^f . We only need to consider g{/3) = 
11/3 - Pf. For any (3 with ||/3||o ^ M, we have 



9,=o /3;=o 



This completes the proof. □ 
Proof of Theorem Note that 



Tm(/3) = argmin + || A0 - A/3f : ||(/>||o ^ M}. 



We have 



/(Tm(/3)) ^ /(/3) + II A/3 - A/3IP = /(/3), 
which completes the proof. □ 

Lemma 2. // the problem ([^ /ias a unique solution denoted by (3* , then (3* is a fixed point 
ofT, I.e., 13* = Tm{I3*). 

Proof of Lemma IB By Theorem HJ f{TM{(3*)) ^ f{f3*)- Since the minimum is unique, we 
have (3* = Tm{(3*). □ 

Proof of Theorem^ Without loss of generality, let ^* = {j G Zp : /3* 7^ 0} = {1, . . . , M}. 
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Denote B* = Zp\ A*. Define a function u on M^^ to be u{xi, . . . ,Xm) = niin{|xj| : j 
1, . . . ,M}. By Lemma El 

/3* = K = Turn = Sm\\ A A)^A ^^^^ 

Since is tfie unique solution, ( ITTll implies 

where || ■ ||oo denotes the ioo norm. 
Consider the set 

E=\^/3eW: u{c''X'^,y + (I - c-'X'^,Xa*)^^, - c-'X'^,Xt3*f3is,) 
> ||c-iX^.y - c-iX^.X^./3^. + (I - c-'X's,Xs*)f3^, \Q. 

We have 

nm - ( " " - -"^-^-)''- - -■xvx«.,3«. I ^ ^ ^ ^^^^ 

By (dH]), (3* e E. Thus, there exists 5 > such that the closed ball {(3 : ||/3 - /3*|| ^ 

5} C E. Denote = max {A„,ax (I - 0-1X^4. X^.), c-i[A„,ax(X'g,X^.X:4.Xe*)]'/'} and 
r = min{5, 5/{y/2v)}. Note that r > 0. For any (3^^^ e D = {(3 : ||/3 - /3*|| ^ r}, by ([I9]), 

11/3^-/3*11 

= ||(I - c-ix:4.X^0(/3!J* - - c-'y^A*^B'l3f,\\ 
^ i^{\\(3'^-(3%\\ + \\(3f4) 
^ v^z/||/3(°) -/3*|| 
^ 6. 
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Therefore, G E. Consider we have 

11/3(2) _/3* 1 1 
= \\{1 - c-'X'^,X^,){(3'il - (3*^,)\\ 

^ A^ax(I-c-ix:4.X^0||/3^'^-/3i- (20) 

Since Aniax(I — c^^X^.X^.) < 1, ( l20l) imphes /S^^-* G -E. By induction, we can prove that for 
k^2, p^^^ G E and 

which imphes /J'-'^^ — )■ /3* as /c — > oo. □ 
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